Classifying Illegal Activities on Tor Network Based on Web Textual Contents

نویسندگان

Mhd Wesam Al Nabki

Eduardo Fidalgo

Enrique Alegre

Ivan de Paz

چکیده

The freedom of the Deep Web offers a safe place where people can express themselves anonymously but they also can conduct illegal activities. In this paper, we present and make publicly available1 a new dataset for Darknet active domains, which we call it ”Darknet Usage Text Addresses” (DUTA). We built DUTA by sampling the Tor network during two months and manually labeled each address into 26 classes. Using DUTA, we conducted a comparison between two well-known text representation techniques crossed by three different supervised classifiers to categorize the Tor hidden services. We also fixed the pipeline elements and identified the aspects that have a critical influence on the classification results. We found that the combination of TF-IDF words representation with Logistic Regression classifier achieves 96.6% of 10 folds cross-validation accuracy and a macro F1 score of 93.7% when classifying a subset of illegal activities from DUTA. The good performance of the classifier might support potential tools to help the authorities in the detection of these activities.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving Tor security against timing and traffic analysis attacks with fair randomization

The Tor network is probably one of the most popular online anonymity systems in the world. It has been built based on the volunteer relays from all around the world. It has a strong scientific basis which is structured very well to work in low latency mode that makes it suitable for tasks such as web browsing. Despite the advantages, the low latency also makes Tor insecure against timing and tr...

متن کامل

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...

متن کامل

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

متن کامل

Leaving Timing Channel Fingerprints in Hidden Service Log Files

Hidden services are anonymously hosted services that can be accessed over an anonymity network, such as Tor. While most hidden services are legitimate, some host illegal content. There has been a fair amount of research on locating hidden services, but an open problem is to develop a general method to prove that a physical machine, once confiscated, was in fact the machine that had been hosting...

متن کامل

Linguagrid: a network of Linguistic and Semantic Services for the Italian Language

In order to handle the increasing amount of textual information today available on the web and exploit the knowledge latent in this mass of unstructured data, a wide variety of linguistic knowledge and resources (Language Identification, Morphological Analysis, Entity Extraction, etc.). is crucial. In the last decade LRaas (Language Resource as a Service) emerged as a novel paradigm for publish...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2017

Classifying Illegal Activities on Tor Network Based on Web Textual Contents

نویسندگان

چکیده

منابع مشابه

Improving Tor security against timing and traffic analysis attacks with fair randomization

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

Leaving Timing Channel Fingerprints in Hidden Service Log Files

Linguagrid: a network of Linguistic and Semantic Services for the Italian Language

عنوان ژورنال:

اشتراک گذاری